

<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="Docutils 0.19: https://docutils.sourceforge.io/" />

  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  
  <title>ECBackend 实现策略 &mdash; Ceph Documentation</title>
  

  
  <link rel="stylesheet" href="../../../../_static/ceph.css" type="text/css" />
  <link rel="stylesheet" href="../../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../../_static/pygments.css" type="text/css" />
  <link rel="stylesheet" href="../../../../_static/ceph.css" type="text/css" />
  <link rel="stylesheet" href="../../../../_static/graphviz.css" type="text/css" />
  <link rel="stylesheet" href="../../../../_static/css/custom.css" type="text/css" />

  
  

  
  

  

  
  <!--[if lt IE 9]>
    <script src="../../../../_static/js/html5shiv.min.js"></script>
  <![endif]-->
  
    
      <script type="text/javascript" id="documentation_options" data-url_root="../../../../" src="../../../../_static/documentation_options.js"></script>
        <script src="../../../../_static/jquery.js"></script>
        <script src="../../../../_static/_sphinx_javascript_frameworks_compat.js"></script>
        <script data-url_root="../../../../" id="documentation_options" src="../../../../_static/documentation_options.js"></script>
        <script src="../../../../_static/doctools.js"></script>
        <script src="../../../../_static/sphinx_highlight.js"></script>
    
    <script type="text/javascript" src="../../../../_static/js/theme.js"></script>

    
    <link rel="index" title="Index" href="../../../../genindex/" />
    <link rel="search" title="Search" href="../../../../search/" />
    <link rel="next" title="Erasure coding enhancements" href="../enhancements/" />
    <link rel="prev" title="jerasure 插件" href="../jerasure/" /> 
</head>

<body class="wy-body-for-nav">

   
  <header class="top-bar">
    <div role="navigation" aria-label="Page navigation">
  <ul class="wy-breadcrumbs">
      <li><a href="../../../../" class="icon icon-home" aria-label="Home"></a></li>
          <li class="breadcrumb-item"><a href="../../../internals/">Ceph 内幕</a></li>
          <li class="breadcrumb-item"><a href="../../">OSD 开发者文档</a></li>
          <li class="breadcrumb-item"><a href="../">纠删码编码的归置组</a></li>
      <li class="breadcrumb-item active">ECBackend 实现策略</li>
      <li class="wy-breadcrumbs-aside">
            <a href="../../../../_sources/dev/osd_internals/erasure_coding/ecbackend.rst.txt" rel="nofollow"> View page source</a>
      </li>
  </ul>
  <hr/>
</div>
  </header>
  <div class="wy-grid-for-nav">
    
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search"  style="background: #eee" >
          

          
            <a href="../../../../" class="icon icon-home"> Ceph
          

          
          </a>

          

          
<div role="search">
  <form id="rtd-search-form" class="wy-form" action="../../../../search/" method="get">
    <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
</div>

          
        </div>

        
        <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
          
            
            
              
            
            
              <ul class="current">
<li class="toctree-l1"><a class="reference internal" href="../../../../start/">Ceph 简介</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../install/">安装 Ceph</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../cephadm/">Cephadm</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../rados/">Ceph 存储集群</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../cephfs/">Ceph 文件系统</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../rbd/">Ceph 块设备</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../radosgw/">Ceph 对象网关</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../mgr/">Ceph 管理器守护进程</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../mgr/dashboard/">Ceph 仪表盘</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../monitoring/">监控概览</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../api/">API 文档</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../architecture/">体系结构</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../developer_guide/">开发者指南</a></li>
<li class="toctree-l1 current"><a class="reference internal" href="../../../internals/">Ceph 内幕</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="../../../balancer-design/">Ceph 如何均衡（读写、容量）</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../blkin/">Tracing Ceph With LTTng</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../blkin/#tracing-ceph-with-blkin">Tracing Ceph With Blkin</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../bluestore/">BlueStore Internals</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../ceph_krb_auth/">如何配置好 Ceph Kerberos 认证的详细文档</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../cephfs-mirroring/">CephFS Mirroring</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../cephfs-reclaim/">CephFS Reclaim Interface</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../cephfs-snapshots/">CephFS 快照</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../cephx/">Cephx</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../cephx_protocol/">Cephx 认证协议详细阐述</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../config/">配置管理系统</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../config-key/">config-key layout</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../context/">CephContext</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../continuous-integration/">Continuous Integration Architecture</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../corpus/">资料库结构</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../cpu-profiler/">Oprofile 的安装</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../crush-msr/">CRUSH MSR (Multi-step Retry)</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../cxx/">C++17 and libstdc++ ABI</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../deduplication/">去重</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../delayed-delete/">CephFS delayed deletion</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../dev_cluster_deployment/">开发集群的部署</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../dev_cluster_deployment/#id5">在同一机器上部署多套开发集群</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../development-workflow/">开发流程</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../documenting/">为 Ceph 写作文档</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../dpdk/">Ceph messenger DPDKStack</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../encoding/">序列化（编码、解码）</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../erasure-coded-pool/">纠删码存储池</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../file-striping/">File striping</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../freebsd/">FreeBSD Implementation details</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../generatedocs/">Ceph 文档的构建</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../health-reports/">Health Reports</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../iana/">IANA 号</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../kclient/">Testing changes to the Linux Kernel CephFS driver</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../kclient/#step-one-build-the-kernel">Step One: build the kernel</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../kclient/#step-two-create-a-vm">Step Two: create a VM</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../kclient/#step-three-networking-the-vm">Step Three: Networking the VM</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../kubernetes/">Hacking on Ceph in Kubernetes with Rook</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../libcephfs_proxy/">Design of the libcephfs proxy</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../libs/">库体系结构</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../logging/">集群日志的用法</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../logs/">调试日志</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../macos/">在 MacOS 上构建</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../mempool_accounting/">What is a mempool?</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../mempool_accounting/#some-common-mempools-that-we-can-track">Some common mempools that we can track</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../messenger/">Messenger notes</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../mon-bootstrap/">Monitor bootstrap</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../mon-elections/">Monitor Elections</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../mon-on-disk-formats/">ON-DISK FORMAT</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../mon-osdmap-prune/">FULL OSDMAP VERSION PRUNING</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../msgr2/">msgr2 协议（ msgr2.0 和 msgr2.1 ）</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../network-encoding/">Network Encoding</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../network-protocol/">网络协议</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../object-store/">对象存储架构概述</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../osd-class-path/">OSD class path issues</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../peering/">互联</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../perf/">Using perf</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../perf_counters/">性能计数器</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../perf_histograms/">Perf histograms</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../placement-group/">PG （归置组）说明</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../quick_guide/">开发者指南（快速）</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../rados-client-protocol/">RADOS 客户端协议</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../rbd-diff/">RBD 增量备份</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../rbd-export/">RBD Export &amp; Import</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../rbd-layering/">RBD Layering</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../release-checklists/">Release checklists</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../release-process/">Ceph Release Process</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../seastore/">SeaStore</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../sepia/">Sepia 社区测试实验室</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../session_authentication/">Session Authentication for the Cephx Protocol</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../testing/">测试笔记</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../versions/">Public OSD Version</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../vstart-ganesha/">NFS CephFS-RGW Developer Guide</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../wireshark/">Wireshark Dissector</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../zoned-storage/">Zoned Storage Support</a></li>
<li class="toctree-l2 current"><a class="reference internal" href="../../">OSD 开发者文档</a><ul class="current">
<li class="toctree-l3"><a class="reference internal" href="../../async_recovery/">异步恢复</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../backfill_reservation/">Backfill Reservation</a></li>
<li class="toctree-l3 current"><a class="reference internal" href="../">纠删码编码的归置组</a><ul class="current">
<li class="toctree-l4"><a class="reference internal" href="../#id2">术语</a></li>
<li class="toctree-l4 current"><a class="reference internal" href="../#id3">内容列表</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="../../last_epoch_started/">last_epoch_started</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../log_based_pg/">Log Based PG</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../manifest/">Manifest</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../map_message_handling/">Map and PG Message handling</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../mclock_wpq_cmp_study/">QoS Study with mClock and WPQ Schedulers</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../osd_overview/">OSD</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../partial_object_recovery/">Partial Object Recovery</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../past_intervals/">OSDMap Trimming and PastIntervals</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../pg/">PG</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../pg_removal/">PG Removal</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../pgpool/">PGPool</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../recovery_reservation/">Recovery Reservation</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../refcount/">Refcount</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../scrub/">Scrub internals and diagnostics</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../snaps/">快照</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../stale_read/">Preventing Stale Reads</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../watch_notify/">关注通知</a></li>
<li class="toctree-l3"><a class="reference internal" href="../../wbthrottle/">回写抑制</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../../../mds_internals/">MDS 开发者文档</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../radosgw/">RADOS 网关开发者文档</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../ceph-volume/">ceph-volume 开发者文档</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../../crimson/">Crimson developer documentation</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../../../../governance/">项目管理</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../foundation/">Ceph 基金会</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../ceph-volume/">ceph-volume</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../releases/general/">Ceph 版本（总目录）</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../releases/">Ceph 版本（索引）</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../security/">Security</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../hardware-monitoring/">硬件监控</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../glossary/">Ceph 术语</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../jaegertracing/">Tracing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../../translation_cn/">中文版翻译资源</a></li>
</ul>

            
          
        </div>
        
      </div>
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">

      
      <nav class="wy-nav-top" aria-label="top navigation">
        
          <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
          <a href="../../../../">Ceph</a>
        
      </nav>


      <div class="wy-nav-content">
        
        <div class="rst-content">
        
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">
            
<div id="dev-warning" class="admonition note">
  <p class="first admonition-title">Notice</p>
  <p class="last">This document is for a development version of Ceph.</p>
</div>
  <div id="docubetter" align="right" style="padding: 5px; font-weight: bold;">
    <a href="https://pad.ceph.com/p/Report_Documentation_Bugs">Report a Documentation Bug</a>
  </div>

  
  <section id="ecbackend">
<h1>ECBackend 实现策略<a class="headerlink" href="#ecbackend" title="Permalink to this heading"></a></h1>
<section id="id1">
<h2>设计初稿的各种起因<a class="headerlink" href="#id1" title="Permalink to this heading"></a></h2>
<p>The initial (and still true for EC pools without the hacky EC
overwrites debug flag enabled) design for EC pools restricted
EC pools to operations that can be easily rolled back:</p>
<ul class="simple">
<li><p>CEPH_OSD_OP_APPEND: We can roll back an append locally by
including the previous object size as part of the PG log event.</p></li>
<li><p>CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
requires that we retain the deleted object until all replicas have
persisted in the deletion event. Erasure Coded backend will therefore
need to store objects with the version at which they were created
included in the key provided to the filestore.  Old versions of an
object can be pruned when all replicas have committed up to the log
event deleting the object.</p></li>
<li><p>CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
to be set or removed, we can roll back these operations locally.</p></li>
</ul>
<p>Log entries contain a structure explaining how to locally undo the
operation represented by the operation
(see osd_types.h:TransactionInfo::LocalRollBack).</p>
<section id="pgtemp-and-crush">
<h3>PGTemp and Crush<a class="headerlink" href="#pgtemp-and-crush" title="Permalink to this heading"></a></h3>
<p>Primaries are able to request a temp acting set mapping in order to
allow an up-to-date OSD to serve requests while a new primary is
backfilled (and for other reasons).  An erasure coded PG needs to be
able to designate a primary for these reasons without putting it in
the first position of the acting set.  It also needs to be able to
leave holes in the requested acting set.</p>
<p>Core Changes:</p>
<ul class="simple">
<li><p>OSDMap::pg_to_*_osds needs to separately return a primary.  For most
cases, this can continue to be acting[0].</p></li>
<li><p>MOSDPGTemp (and related OSD structures) needs to be able to specify
a primary as well as an acting set.</p></li>
<li><p>Much of the existing code base assumes that acting[0] is the primary
and that all elements of acting are valid.  This needs to be cleaned
up since the acting set may contain holes.</p></li>
</ul>
</section>
<section id="distinguished-acting-set-positions">
<h3>Distinguished acting set positions<a class="headerlink" href="#distinguished-acting-set-positions" title="Permalink to this heading"></a></h3>
<p>With the replicated strategy, all replicas of a PG are
interchangeable.  With erasure coding, different positions in the
acting set have different pieces of the erasure coding scheme and are
not interchangeable.  Worse, crush might cause chunk 2 to be written
to an OSD which happens already to contain an (old) copy of chunk 4.
This means that the OSD and PG messages need to work in terms of a
type like pair&lt;shard_t, pg_t&gt; in order to distinguish different PG
chunks on a single OSD.</p>
<p>Because the mapping of an object name to object in the filestore must
be 1-to-1, we must ensure that the objects in chunk 2 and the objects
in chunk 4 have different names.  To that end, the object store must
include the chunk id in the object key.</p>
<p>Core changes:</p>
<ul class="simple">
<li><p>The object store <a class="reference external" href="https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241">ghobject_t needs to also include a chunk id</a> making it more like
tuple&lt;hobject_t, gen_t, shard_t&gt;.</p></li>
<li><p>coll_t needs to include a shard_t.</p></li>
<li><p>The OSD pg_map and similar PG mappings need to work in terms of a
spg_t (essentially
pair&lt;pg_t, shard_t&gt;).  Similarly, pg-&gt;pg messages need to include
a shard_t</p></li>
<li><p>For client-&gt;PG messages, the OSD will need a way to know which PG
chunk should get the message since the OSD may contain both a
primary and non-primary chunk for the same PG</p></li>
</ul>
</section>
<section id="object-classes">
<h3>Object Classes<a class="headerlink" href="#object-classes" title="Permalink to this heading"></a></h3>
<p>Reads from object classes will return ENOTSUP on EC pools by invoking
a special SYNC read.</p>
</section>
<section id="scrub">
<h3>Scrub<a class="headerlink" href="#scrub" title="Permalink to this heading"></a></h3>
<p>The main catch, however, for EC pools is that sending a crc32 of the
stored chunk on a replica isn’t particularly helpful since the chunks
on different replicas presumably store different data.  Because we
don’t support overwrites except via DELETE, however, we have the
option of maintaining a crc32 on each chunk through each append.
Thus, each replica instead simply computes a crc32 of its own stored
chunk and compares it with the locally stored checksum.  The replica
then reports to the primary whether the checksums match.</p>
<p>With overwrites, all scrubs are disabled for now until we work out
what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst).</p>
</section>
<section id="crush">
<h3>Crush<a class="headerlink" href="#crush" title="Permalink to this heading"></a></h3>
<p>If crush is unable to generate a replacement for a down member of an
acting set, the acting set should have a hole at that position rather
than shifting the other elements of the acting set out of position.</p>
</section>
</section>
</section>
<section id="id2">
<h1>ECBackend<a class="headerlink" href="#id2" title="Permalink to this heading"></a></h1>
<section id="main-operation-overview">
<h2>MAIN OPERATION OVERVIEW<a class="headerlink" href="#main-operation-overview" title="Permalink to this heading"></a></h2>
<p>A RADOS put operation can span
multiple stripes of a single object. There must be code that
tessellates the application level write into a set of per-stripe write
operations -- some whole-stripes and up to two partial
stripes. Without loss of generality, for the remainder of this
document, we will focus exclusively on writing a single stripe (whole
or partial). We will use the symbol “W” to represent the number of
blocks within a stripe that are being written, i.e., W &lt;= K.</p>
<p>There are three data flows for handling a write into an EC stripe. The
choice of which of the three data flows to choose is based on the size
of the write operation and the arithmetic properties of the selected
parity-generation algorithm.</p>
<ol class="arabic simple">
<li><p>Whole stripe is written/overwritten</p></li>
<li><p>A read-modify-write operation is performed.</p></li>
</ol>
<section id="whole-stripe-write">
<h3>WHOLE STRIPE WRITE<a class="headerlink" href="#whole-stripe-write" title="Permalink to this heading"></a></h3>
<p>This is a simple case, and is already performed in the existing code
(for appends, that is). The primary receives all of the data for the
stripe in the RADOS request, computes the appropriate parity blocks
and send the data and parity blocks to their destination shards which
write them. This is essentially the current EC code.</p>
</section>
<section id="read-modify-write">
<h3>READ-MODIFY-WRITE<a class="headerlink" href="#read-modify-write" title="Permalink to this heading"></a></h3>
<p>The primary determines which of the K-W blocks are to be unmodified,
and reads them from the shards. Once all of the data is received it is
combined with the received new data and new parity blocks are
computed. The modified blocks are sent to their respective shards and
written. The RADOS operation is acknowledged.</p>
</section>
<section id="osd-object-write-and-consistency">
<h3>OSD Object Write and Consistency<a class="headerlink" href="#osd-object-write-and-consistency" title="Permalink to this heading"></a></h3>
<p>Regardless of the algorithm chosen above, writing of the data is a two-
phase process: commit and rollforward. The primary sends the log
entries with the operation described (see
osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack).
In all cases, the “commit” is performed in place, possibly leaving some
information required for a rollback in a write-aside object.  The
rollforward phase occurs once all acting set replicas have committed
the commit, it then removes the rollback information.</p>
<p>In the case of overwrites of existing stripes, the rollback information
has the form of a sparse object containing the old values of the
overwritten extents populated using clone_range.  This is essentially
a place-holder implementation, in real life, bluestore will have an
efficient primitive for this.</p>
<p>The rollforward part can be delayed since we report the operation as
committed once all replicas have been committed.  Currently, whenever we
send a write, we also indicate that all previously committed
operations should be rolled forward (see
ECBackend::try_reads_to_commit).  If there aren’t any in the pipeline
when we arrive at the waiting_rollforward queue, we start a dummy
write to move things along (see the Pipeline section later on and
ECBackend::try_finish_rmw).</p>
</section>
<section id="extentcache">
<h3>ExtentCache<a class="headerlink" href="#extentcache" title="Permalink to this heading"></a></h3>
<p>It’s pretty important to be able to pipeline writes on the same
object.  For this reason, there is a cache of extents written by
cacheable operations.  Each extent remains pinned until the operations
referring to it are committed.  The pipeline prevents rmw operations
from running until uncacheable transactions (clones, etc) are flushed
from the pipeline.</p>
<p>See ExtentCache.h for a detailed explanation of how the cache
states correspond to the higher level invariants about the conditions
under which concurrent operations can refer to the same object.</p>
</section>
<section id="pipeline">
<h3>Pipeline<a class="headerlink" href="#pipeline" title="Permalink to this heading"></a></h3>
<p>Reading src/osd/ExtentCache.h should have given a good idea of how
operations might overlap.  There are several states involved in
processing a write operation and an important invariant which
isn’t enforced by PrimaryLogPG at a higher level which needs to be
managed by ECBackend.  The important invariant is that we can’t
have uncacheable and rmw operations running at the same time
on the same object.  For simplicity, we simply enforce that any
operation which contains an rmw operation must wait until
all in-progress uncacheable operations complete.</p>
<p>There are improvements to be made here in the future.</p>
<p>For more details, see ECBackend::waiting_* and
ECBackend::try_&lt;from&gt;_to_&lt;to&gt;.</p>
</section>
</section>
</section>



<div id="support-the-ceph-foundation" class="admonition note">
  <p class="first admonition-title">Brought to you by the Ceph Foundation</p>
  <p class="last">The Ceph Documentation is a community resource funded and hosted by the non-profit <a href="https://ceph.io/en/foundation/">Ceph Foundation</a>. If you would like to support this and our other efforts, please consider <a href="https://ceph.io/en/foundation/join/">joining now</a>.</p>
</div>


           </div>
           
          </div>
          <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
        <a href="../jerasure/" class="btn btn-neutral float-left" title="jerasure 插件" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
        <a href="../enhancements/" class="btn btn-neutral float-right" title="Erasure coding enhancements" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
    </div>

  <hr/>

  <div role="contentinfo">
    <p>&#169; Copyright 2016, Ceph authors and contributors. Licensed under Creative Commons Attribution Share Alike 3.0 (CC-BY-SA-3.0).</p>
  </div>

   

</footer>
        </div>
      </div>

    </section>

  </div>
  

  <script type="text/javascript">
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
  </script>

  
  
    
   

</body>
</html>